Aiming at the forgetting and underutilization of the text information of image in image captioning methods, a Scene Graph-aware Cross-modal Network (SGC-Net) was proposed. Firstly, the scene graph was utilized as the image’s visual features, and the Graph Convolutional Network (GCN) was utilized for feature fusion, so that the visual and textual features were in the same feature space. Then, the text sequence generated by the model was stored, and the corresponding position information was added as the textual features of the image, so as to solve the problem of text feature loss brought by the single-layer Long Short-Term Memory (LSTM) Network. Finally, to address the issue of over dependence on image information and underuse of text information, the self-attention mechanism was utilized to extract significant image information and text information and fuse then. Experimental results on Flickr30K and MS-COCO (MicroSoft Common Objects in COntext) datasets demonstrate that SGC-Net outperforms Sub-GC on the indicators BLEU1 (BiLingual Evaluation Understudy with 1-gram), BLEU4 (BiLingual Evaluation Understudy with 4-grams), METEOR (Metric for Evaluation of Translation with Explicit ORdering), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and SPICE (Semantic Propositional Image Caption Evaluation) with the improvements of 1.1,0.9,0.3,0.7,0.4 and 0.3, 0.1, 0.3, 0.5, 0.6, respectively. It can be seen that the method used by SGC-Net can increase the model’s image captioning performance and the fluency of the generated description effectively.